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Abstract 

The adoption of probabilistic models for the best individuals found so far is a powerful approach for 
evolutionary computation. Increasingly more complex models have been used by estimation of distri- 
bution algorithms (ED As), which often result better effectiveness on finding the global optima for hard 
optimization problems. Supervised and unsupervised learning ofBayesian networks are very effective op- 
tions, since those models are able to capture interactions of high order among the variables of a problem. 
Diversity preservation, through niching techniques, has also shown to be very important to allow the 
identification of the problem structure as much as for keeping several global optima. Recently, clustering 
was evaluated as an effective niching technique for ED As, but the performance of simpler low-order EDAs 
was not shown to be much improved by clustering, except for some simple multimodal problems. This 
work proposes and evaluates a combination operator guided by a measure from information theory which 
allows a clustered low-order EDA to effectively solve a comprehensive range of benchmark optimization 
problems. 

Keywords: clustering, evolutionary computation, linkage learning, niching. 



1 Introduction 

Evolutionary algorithms solve optimization problems by evolving successive populations of so- 
lutions until convergence occurs. Two steps are usually present at each generation: selection of 
promising solutions and creation of new solutions in order to obtain a new population. 

Combination of genetic information is a major concern in evolutionary computation. In the 
simple genetic algorithm (sGA) [1] this mechanism is implemented as the crossover operator, 
which creates a new individual from two parents by combining portions of both strings. Recently, 
estimation of distribution algorithms (EDAs) started a novel approach for learning information 
from the best individuals, which involves inferring a probabilistic model and sampling from this 
model in order to generate the next population. Combination of information is achieved in EDAs 
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since a single model is built from several good individuals. Unfortunately, combining different 
individuals may lead to poor results if the model adopted is not expressive enough. 

Simpler order-1 EDAs adopt probabilistic models which assume independence among genes. 
This class of EDA is known for its simplicity and computational efficiency in model learning, 
since no search for model structure needs to be performed [2]. Further, the simple conception 
and implementation should make those algorithms very attractive. Their low effectiveness on 
harder benchmark problems, however, is unacceptable. This is a major drawback, since genetic 
and evolutionary algorithms are known for their wide applicability and robustness [3]. 

High order EDAs, on the other hand, are based on learning the linkage among genes by in- 
ferring expressive probabilistic models based on searching for a factorization, which captures 
the dependencies among genes. Good results are reported for several problems in the literature 
whereas a high computational cost associated to the model induction stage is imposed in this class 
of EDAs. Finding a factorization can be a computationally expensive process and the resulting 
graph is often a suboptimal solution [4] [5]. 

One of the most important efforts to make EDAs more effective is to adopt clustering as a 
strong niching approach, inducing the preservation of diversity in the population. Niching is 
crutial for evolutionary computation in general [6] and for EDAs in particular [7]. It improves 
the identification of the problem structure as much as enhances the chances of finding a higher 
number of global optima on multimodal problems. K-means clustering algorithm [8] has recently 
been applied as a niching technique based on grouping genotypically similar solutions together. 
The performance of simpler low-order EDAs, however, was not show to be much improved by 
clustering except for some simple unstructured multimodal problems. Low-order clustered EDAs 
have not been able to solve hard deceptive structured problems [9]. 

The main contribution of this paper is to show that a simple low-order EDA aided by cluster- 
ing the population and guided by information measures is able to perform linkage learning and, 
therefore, solve a representative set of benchmark problems. This work extends a previous paper 
[10], where some of the ideas and results reported here were first presented. Here the foundations 
of the algorithm and operators proposed are discussed in a more detailed fashion, and a wider 
set of experiments is presented. 

The main difference between the new approach and other similar clustered low-order EDAs 
is on how to deal with the combination of information. Clustering, or similar unsupervised learn- 
ing approaches, are usually applied to the population of EDAs in order to prevent combination of 
different niches (clusters). Conversely, the new operator trusts in the combination of information 
from different clusters as an effective method for exploring the search space. This combination, 
however, must be carefully performed, respecting the information carried out by each cluster. 
The new operator captures relevant information about the problem structure from the probabilis- 
tic models of the clusters and builds a combined probabilistic model from two different parent 
clusters, attempting to maintain the most informative part of each parent intact, hence preserving 
the problem structure. 

Experiments illustrate the effectiveness of this combination operator on detecting the problem 
structure. When the operator is turned off or replaced by a random recombination, the algorithm 
is not able to find the global optima of any benchmark problem tested. The problems adopted 
here illustrate several aspects which have been recently considered as tricky for EDAs, such as 
deception [11], symmetry [9], hierarchy [12], global multimodality [3] and the presence of over- 
lapping building blocks [13]. The structured fashion of those and other classes of problems makes 
them hard for low-order and even for high-order EDAs. 
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This work is organized as follows. Section 2 discusses the main concepts related to the field of 
estimation of distribution algorithms. Section 3 reviews the most closely related work concerning 
to the application of clustering in estimation of distribution algorithms. A discussion about how 
a generic clustered order 1 EDA could detect higher order of interactions is presented in Section 
4, where a new operator guided by information measures is proposed. 

Section 5 presents the algorithm </?-PBIL, which implements the ideas discussed in Section 
4. A default parameter setting for the algorithm is discussed in Section 6. Section 7 describes the 
benchmark optimization problems which were adopted in this paper. Section 8 presents empirical 
verifications of the performance of t^-PBIL, comparing it to state-of-the-art compentent EDAs. 
Finally Section 9 discusses results and implications of this work. 

2 Linkage Learning and Estimation of Distribution Algorithms 

Interacting variables are more informative together than alone. Interaction among variables is an 
irreducible whole; a dependency which cannot be broken [14]. A classic example from machine 
learning literature related to the importance of interaction among variables is the XOR problem 
[15] which is not linearly separable and cannot be solved without capturing interactions. 

In genetic and evolutionary computation, the identification and preservation of important 
interactions among genes is called linkage learning and this topic gained increasing attention from 
the community. Most linkage learning techniques aim at the identification of substructures which 
should be conserved during combination [16]. A similar concept in genomics, called genetic 
linkage, is defined as the association of genes on the same chromosome. When two genes are 
independent, the Mendelian law of independent assortment states that the segregation of one 
gene is independent of the segregation of the other [17]. 

Simple genetic algorithm (sGA) with one-point crossover relies on the ordering of the genes in 
the codification of the problem. In order to achieve success it is required that interacting variables 
should be coded as nearby genes in the chromosome in such a way that crossover would less 
probably cause the disruption of substructures. More recently other approaches adopted alterna- 
tive schemes ranging from the simple reordering of the genes to more complex mechanisms like 
subspecification and superspecification of solutions, as in Messy GA [18]. This algorithm adopts 
a two-stage evolutionary process where substructures are identified in a first stage and subse- 
quently combined. This ensures that substructures were all correctly identified before going on 
and exploring the combinations among them. 

A different situation occurs in Estimation of Distribution Algorithms (EDAs), where proba- 
bilistic modeling is applied as the learning mechanism for the evolutionary process. Algorithm 1 
is the general description of an EDA. 



Algorithm 1 A general estimation of distribution algorithm 
Generate an initial random population. 
Compute the fitness of the individuals, 
while convergence criteria are not met do 

Build a probabilistic model from the best individuals in the population. 

Sample from this model in order to breed new individuals and update the population, 
end while 
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Figure 1: A Bayesian network structure learned during the evolutionary process of an EDA for 
the concatenated trap-4 problem. An edge represents dependence among variables. 

An EDA solves a problem by building probabilistic models of the best solutions from the 
population. New solutions (individuals) are sampled from the model, so the traditional operators 
of crossover and mutation, which operate at the level of the individuals, are not suited for EDAs. 

The most effective EDAs adopt fairly complex models which should capture the whole struc- 
ture of dependence among the genes of the problem. The major motivation for learning complex 
and powerful models is to acquire and maintain information about interactions among the vari- 
ables of the problem. Some earlier EDAs, however, learn probabilistic models which assume 
independence among all genes, as 

m 

n(x) = l[n l (x i ) (1) 

»=i 

where tt is the joint probabilistic model 

x is the binary vector of an individual 
TTi(x t ) is the marginal model of gene i 

For a binary codification, the marginal model for each gene is the binomial proportion of Is 
in that gene for the selected individuals. Algorithms based on this kind of model are not able 
to detect any interactions among the genes without additional support, therefore they present a 
similar behavior as a GA with uniform crossover [2]. Since no interactions are considered and 
only order 1 statistics are used, those algorithms should be called order 1 EDAs. One of the most 
important member of this class is PBIL (Population Based Incremental Learning) [19]. 

In PBIL the population is represented by a probability vector p = (p 1; v p , . . . ,p m ), as in (1), 
where pj represents the probability of an individual to possess a 1 in gene j. At each generation, 
M individuals are generated and the best N are selected to update the model. 

On the other hand, an EDA based on Bayesian networks can capture the interactions among 
the genes: 

m 

n ( x ) = Y[' IT i( x i\P a i) ( 2 ) 
i=l 
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where tt is the joint probabilistic model 

x is the binary vector of an individual 
Wi(xi) is the marginal model of gene i 
r pa i is the set of parents of x t 

The set pa i comprises the variables on which xi depends. This factorization is only possible 
because, for each i, x { is assumed to be independent of its nondescendants, given its parents 
pa { [20]. Figure 2 illustrates two possible factorization assumptions and figure 1 illustrates a 
Bayesian network learned at some stage of an evolutionary process for the concatenated trap-4 
problem [11] with 4 subproblems. EDAs based on higher order statistics (typically higher than 
order 2) should be called higher-order EDAs, in contrast to low-order EDAs which use statistical 
models where no interactions or, at most, only pairwise interactions are considered. Two very 
representative higher-order EDAs which adopt Bayesian Networks are the Bayesian Optimization 
Algorithm (BOA) [11] and the Estimation of Bayesian Network Algorithm (EBNA) [21]. The 
difference among them is on the metric used to evaluate factorizations when searching for the 
model structure. 



3 Related work 

It was already shown that clustering improves the performance of some EDAs for certain classes 
of problems. For globally multimodal problems, for instance, clustering has revealed a promising 
approach [3]. Such problems present several global optima or, in other words, several optimal 
solutions with the same fitness. Those problems may be very tricky for an evolutionary algo- 
rithm to solve, since slow convergence to one of the optima (genetic drift) often occurs. The 
retarded convergence is explained by the combination of solutions coming from different regions 
of the search space, which often results in poor solutions. The application of clustering in EDAs 
when solving globally multimodal problems can be described as the separation of the population 
in subpopulations (one for each cluster), and the subsequent learning of different probabilistic 
models for each subpopulation. Breeding generally involves no combination among different 
subpopulations. Clustering improves the performance of the evolutionary algorithm by such an 
interbreeding avoidance. However, this approach leads to a smaller exploration of the space, 
since combination from distant regions could be favorable. 

Clustering is adopted, for the first time, as the niching technique of an EDA in [9]. Only bi- 
nary codifications are considered, and k-means clustering was applied to obtain subpopulations 
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Figure 2: (a) the model of independence among variables, as adopted by PBIL [19] and cGA [2] (b) 
the model adopted by higher-order EDAs, which assumes that some dependence structure exists 
and should be inferred from the selected individuals. 
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at each generation. Representative EDAs and widely known globally multimodal test problems 
were used in the experiments. The most noticeable result is the major improvement in effec- 
tiveness and convergence speed of UMDA - a single-order EDA - when clustering is applied. 
The clustered UMDA outperforms a simple UMDA for multimodal problems without a complex 
structure like twomax. The number of global optima found and stably maintained is increased 
with clustering. Those results are explained by the speciation of each cluster in a different peak. 

However, for more structured problems, clustering does not help UMDA very much. Ac- 
tually, the literature does not report any low-order EDA which is able to solve hard deceptive 
structured problems, even when aided by clustering. Single-order clustered EDAs are only ex- 
pected to perform well on globally multimodal optimization problems because combination of 
solutions from different basins is avoided [3]. Actually, the beneficial effect claimed by [9] is to 
control combination among different subpopulations since interbreeding is considered as harm- 
ful for the evolutionary process. Exploration of the search space is done within each cluster by an 
evolutionary algorithm in a nearly parallel fashion. 

The authors of [9] conclude by stating that more sophisticated operators, besides just cluster- 
ing, should be used for an efficient scale-up behavior on low-order EDAs on difficult problems. 

A similar approach to clustering is the parallelization of EDAs by adopting multiple subpopu- 
lations and migration. In [22], a simple recombination operator called PV-wise uniform crossover 
is proposed, which is similar to GA uniform crossover. After two parent PVs were selected for 
combination a new temporary PV is built by randomly selecting, for each gene, from which of the 
two parents should the binomial proportion be taken from. A new individual is sampled from 
this PV. 

It is possible to derive a general framework for a very simple EDA based on single order statis- 
tics and aided by some sort of clustering, as in [9]. Incrementally, or at each generation, the best 
individuals would be clustered by similarity of their genotype and, for each cluster, independent 
models for each gene would be learned. Assuming a fixed number k of clusters and a fixed num- 
ber p of genes, a total of p ■ k independent binomial proportions would be kept updated during 
the evolutionary process. New individuals would be generated by sampling from the probability 
vector of a single cluster, or combination among clusters (interbreeding), similarly as in [22]. 

Another perspective for the application of clustering in EDAs can be found in [3]. The unsu- 
pervised estimation of Bayesian network algorithm (UEBNA) uses a model-based approach for 
clustering, based on the unsupervised learning of Bayesian networks at each generation. An un- 
observed variable C is included in the model, which represents the unknown cluster label. The 



(a) dependence among variables and (b) dependence only upon a 

upon a cluster label cluster label 




Figure 3: (a) the model adopted by UEBNA, including a cluster label in the structure (b) the 
model adopted by a clustered order 1 EDA with the concept-guided combination. All variables 
are assumed to be independent, and the only dependence is upon the cluster label C. This model 
is adopted by the </?-PBIL algorithm 
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Bayesian network represents a joint probabilistic model for the best individuals, considering a 
factorization which includes all genes and the cluster random variable C. 

The structure of a Bayesian network describes a factorization as a directed acyclic graph, as il- 
lustrated in figure 2b. The nodes denote random variables and the edges represent dependencies 
among those variables. Additionally, a set of parameters of the network specify conditional dis- 
tributions given the possible values of the corresponding parents pa i of each variable i. UEBNA 
adopts unsupervised learning of Bayesian networks since the cluster variable is unspecified. The 
structure of this kind of Bayesian network is illustrated in figure 3a. 

The algorithm is expected to detect the correct setting of diverse global optima to diverse 
subpopulations and capture dependencies among the variables of the problem through learning 
Bayesian networks at successive generations. Experimental validation confirms the effectiveness 
of the algorithm for a set of globally multimodal structured problems. The most relevant prob- 
lem tested was graph bisection, which is a hard problem even for higher-order EDAs like EBNA 
or BOA. Actually, UEBNA outperforms EBNA achieving a higher number of global optima es- 
pecially for greater instances of graph bisection, which validates the relevance of clustering for 
evolutionary computation. 

Unfortunately, the Bayesian network-based model adopted by EBNA, UEBNA, BOA and 
other higher-order EDAs requires a computationally complex learning stage, which must be run 
at each generation. Actually, finding the best structure of a Bayesian network is known to be a 
NP-hard optimization problem itself. In practice, a greedy algorithm is executed to perform the 
search, which can be based on penalized maximum likelihood. Penalization avoids excessively 
complex network structures. 

It was already recognized that learning the structure of a Bayesian network at each generation 
may become a bottleneck for EDAs and some attempts to overcome this problem have been pro- 
posed [4]. On the other hand, incorporating even more elements into the model, as in [3] where 
clustering labels are used, seems to be worthy and should be considered, but this approach results 
models of higher complexity. 

It would be very worthy, however, if a simpler EDA could reach similar performance as more 
complex EDAs do. As suggested by [9], clustering itself does not solve the problem, but further 
mechanisms could also be usefully incorporated. In the next sections a combination operator 
is evaluated, which allows a low-order clustered EDA to achieve success on the linkage learn- 
ing task. This operator explores information gained from clustering itself and from single order 
statistics from the subpopulations. The proposed approach allows an effective exploration of 
the search space by combining the most relevant information. Experiments illustrate that the 
algorithm preserves the structure of complex problems which present high interactions among 
variables, despite only adopting statistics of low order. 

Two major motivations for the application of clustering in EDAs can be derived form the 
works reviewed above: (i) to increase preservation of the diversity in the population and (ii) 
to avoid combination from different peaks in multimodal optimization. Another motivation for 
clustering is (iii) to allow the identification of a mixture of distributions in continuous optimiza- 
tion. This last motivation is often described in the literature [23] [7], but it is not directly related to 
the work herein presented. 

The relevance of motivation (i) is clear since diversity maintenance prevents premature con- 
vergence to local optima and allows for a proper exploration of the search space. This topic is 
related to the niching methods of simple genetic algorithms, and was widely studied. Motivation 
(ii) is, however, conflictive with the first. The combination of information from distant regions is 
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a strong mechanism for exploring the space, since it could potentially result in better solutions. 
Even combination from different local optima is useful, since those optima may result from dif- 
ferent substructures, which were already found and should be combined. However, this aspect 
was not well studied yet and is explored later in this paper. 

4 Using an information measure to guide combination 

Low-order EDAs have not been effective on solving several classes of problems, particularity 
when high order of interactions among the variables occur. Clustering was applied as a niching 
technique for EDAs, as reviewed in Section 3. Individuals are grouped by similarity of their geno- 
type, and combination among different groups is usually avoided. However, the performance of 
simple low-order EDAs was not much increased by using clustering. 

One of the reasons which motivate the application of niching is to avoid premature conver- 
gence to local optima. Diversity preservation, however, is also important even when a single 
optimum exists [6] since substructures should be explored and recombined before the global op- 
tima is correctly identified. 

The existence of substructures in the problem is also an important factor which motivates nich- 
ing. When an evolutionary algorithm is relatively away from convergence, we can conjecture that 
the main source of diversity in the population is due to the diversity of substructures, since global 
optima are not clear yet. Thus, at some stage of the process, a clustering algorithm would group 
together individuals possessing the same substructures or, in other words, different substructures 
would characterize each subpopulation. 

The concepts which characterize each cluster are, thus, relevant information which should be 
explored. Actually, the combination of important parts of promising solutions found so far should 
guide GAs and EDAs during the exploration of the search space [9]. Therefore, combination from 
different clusters should not be avoided, but carefully performed. 

This Section describes a combination mechanism for clustered EDAs which carefully chooses 
how to combine information from clusters. Section 3 reviewed a random naive combination 
mechanism called PV-wise crossover [22]. However, it is shown later that this operator is not 
effective for the exploration of the search space and is not able to combine valuable information 
from parents, similarly as the GA with uniform crossover. 

A more sophisticated but still very computationally efficient operator can be built using ideas 
from information theory. It also combines information from different clusters by building a tem- 
porary PV from two randomly selected parent PVs. The difference is on how to choose from 
which parent to take each binomial proportion. A measure from information theory guides the 
choice of the best parent for each gene. It aims for a careful combination of relevant information 
from two PVs, attempting to maintain the most informative part of each parent. This operator is 
called concept-guided combination or, for brevity, the cg-combination. 

During a cg-combination, two parent clusters - A and B - are, randomly and proportionally 
to the mean fitness, selected. A temporary PV, from which a single new individual will be created, 
is obtained by taking proportions for each position j from either ttaj or ttbj- A measure from 
information theory is used to guide that choice in order to select always the most informative 
parent for each gene. Let 
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Figure 4: A possible setting of binomial proportions 7r,.jS and their respective u>ijs at some stage 
of the evolutionary process for a certain artificial problem. The values inscribed are the TTijS, while 
the grey tones denote the Wi.j in a scale ranging from black (minimum) to white (maximum). 



be the entropy of the distribution of gene j, where pj q = P(xj = q) is the proportion of individu- 
als possessing the value q for gene j in the whole population (q 6 {0, 1}). Similarly let 

Kj = ~ Pi,U l °92(Pi,j,q) ( 4 ) 

?e{o,i} 

be the entropy of the distribution of the same gene j without taking cluster i into account in the 
estimated proportions. The proportion p- J(J is computed as P(xj = q | c(x) 6 {1, 2, . . . , k} — {«}). 

The measure of how informative is a cluster i to a gene j is therefore the difference in the 
entropy of the distribution of the gene j before and after observing cluster i: 

u>i,j = hij - h' itj (5) 

Thus the decision of how to build the temporary PV becomes simple: choose, for each gene j, 
the parent with the greatest my among all is. Each position Vj of the temporary PV v is defined 
as 

v . _ \^A,j, ifu>A,j > U>B,j ^ 
\^B,j; otherwise 

After computation of v a new individual is generated by sampling from each position of v 
independently. Figure 4 shows a scenario with k = 3 clusters and 12 variables. Figure 5 illustrates 
the creation of v after an interbreeding for a small problem with four variables. 

The cg-combination relies that clustering could group together individuals possessing the 
same substructure, and attepts to extract the information about that substructure from W and 
II. A given cluster is expected to be informative for all genes which define a building block if 
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Figure 5: Generating a temporary PV during an interbreeding operation, tta.., kb,. and their 
respective u>ijs are shown using the same color convention as in figure 4 (except for the resulting 
PV which has no WijS). 
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most individuals in that cluster possess that building block. Overlapping building blocks do not 
represent a problem: two clusters would be informative to the same gene and this would still 
preserve substructures. This behavior is illustrated in Section 8. 

In addition, high-fitness similar individuals resulting form combinations of lower-level build- 
ing blocks would also be grouped together. Thus, relevant substructures are combined as the 
process goes on. 

The next Section presents a low-order EDA that effectively adopts the cg-combination as the 
interbreeding mechanism. 



5 (^-PBIL: A Clustering-based Evolutionary Algorithm 

The previous Section presented an operator called cg-combination, which supports the combina- 
tion of information from different clusters and, potentially, allows a low-order clustered EDA to 
effectively perform linkage learning. This operator can be used in the "interbreed" step. Each 
cluster defines a probabilistic model and genes are assumed conditionally independent given the 
cluster label, therefore only dependencies between a gene and the cluster label are considered. 
The resulting model is, therefore, limited to order 2 statistics. 

The algorithm i^-PBIL (Concept-guided Population-Based Incremental Learning) [10] is a low- 
order EDA which follows the proposed approach and performs interbreeding. Algorithm 2 shows 
the pseudocode of </?-PBIL. 

^-PBIL follows an incremental architecture where a single individual is generated at each 
iteration without the adoption of a succession of "generations" as other EDAs often do. A fixed 
number of k clusters are maintained and continuously updated. Whenever a new individual is 
generated, it replaces the worst individual in the population (if the new individual is not even 
worse) and, subsequently, clustering hypothesis and probabilistic models for each cluster are just 
updated through a single typical k-means step, and not fully relearned. Each cluster defines a 
subpopulation and, since only binary variables are allowed, then the probabilistic models for the 
subpopulations are just the corresponding binomial proportions A = (ftij) which denote the 
proportions of individuals with the value 1 for each gene j on each cluster i. 

Sampling from one of the probability vectors (PVs) will generate a new individual, similarly 
to other clustered EDAs. A PV is chosen randomly and proportionally to the mean fitness of the 
individuals of each corresponding cluster. 

Sampling from a single PV, however, is not the only option for generating a new individ- 
ual in (yS-PBIL, since the combination of PVs is also considered. Actually, the cg-combination is 
chosen as the default interbreeding mechanism for i^-PBIL. Two interbreeding mechanisms were 
discussed in the previous Section. Both propose to combine two PVs, obtain a temporary PV v 
and sample from v to generate a new individual. Experiments in the next sections illustrate how 
the PV uniform crossover fails on combining relevant information from two clusters, while the 
cg-combination succeeds. 

Two features were added in order to increase the performance of the algorithm. The first 
feature was motivated by the need for local search. The maximum likelihood estimator (MLE) of 
a binomial proportion is the simple mean of successes. In our case, it should be interpreted as the 
proportion of ones at each gene, or 
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Algorithm 2 </?-PBIL 



//Initialization: 

Generate an initial random population of size JV . 
Compute the fitness of the individuals. 
Select only the N W (N W < N ) best. 
// Learning: 

Learn clusters from the population. 

Compute a probability vector (PV) of binomial proportions for each cluster. 

Store the PV's in the matrix II = (nij). 

Compute the matrix of information measures W = (u>i.j). 

while convergence criteria are not met do 

// Breed a new individual H: 

Create H by randomly choosing from one of the two following possible procedures: 
(i) sample from one of the PVs 
or (ii) apply interbreeding, guided by the matrix 
W, and sample from the combined PV. 

//Selection: 

Compute the fitness of the new individual H. 

if H is not worse than the worst in population then 

Delete this worst individual. 

Insert H in the population, 
end if 
//Learning: 

Update clusters and matrixes n and W. 
end while 



where J2 c (x)=i x i ^ s tne num ber of individuals in cluster i possessing the value 1 at gene j and 
Hi is the number of individuals in cluster i. Unfortunately when all individuals in a cluster have 
the same value (all O's, or all l's) in a certain locus, this estimator saturates at one of the extremes 
(0% or 100%). Sampling from those extreme proportions wipes out the chance of the alternative 
value, 1 or 0, to be generated and this behavior is harmful for local search. 

A perturbation mechanism was added which changes slightly the binomial proportions esti- 
mated and, therefore, allows for an allele to be generated even if all individual possess the com- 
plementary allele. The Wilson estimator reviewed in [24] incorporates a degree of uncertainty by 
estimating the binomial proportion as 

J2c(x)=i X 3 + 2 . . 

^ = n l + 4 (8) 

is used instead of (7), which is the maximum likelihood estimator (sample mean). 

Thus, for instance, even if all 100 individuals in a cluster possess the value 1 in a locus there 
still remains a probability of 2% of a new individual to be generated with a in that same locus. 
Some kind of mutation operator could also be applied, probably with similar effects on local 
search. Bayesian inference could also be applied since it properly express the uncertainty about 
the value of the parameter through the specification of a prior distribution for the parameter. 
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Table 1: t^-PBIL parameters 



Parameter 


Description 


Default 
value 


No 


initial population size 




N w 


working population size 




k 


number of clusters 




Pc 


probability of interbreeding 


50% 


Paid 


probability of an old 
clustering hyposthesis to be used 


50% 


Pw 


probability of the Wilson 
estimator to be used 


50% 



Another feature was added in order to improve recombination for overlapping and hierarchi- 
cal building blocks. When smaller BBs start blending, new BBs emerge. This causes the loss of 
older clustering hypotheses, consequently inhibiting those initial BBs in recent individuals. This 
is a consequence of the well known race between selection and innovation [25]. For overlapping 
and / or hierarchical building blocks, this constitutes a great problem. A potential solution is the 
maintenance of some old BBs (therefore, old clustering hypotheses), which can be used to gener- 
ate new individuals. Both the current and an old clustering hypothesis compete when breeding. 
This mechanism works as follows: both W = (liij) and ft = (7^) matrixes are stored, for some 
old well-performing clustering hypotheses. The cg-combination selects randomly from one of 
two clustering hypotheses: the old set {fl id, W u} and the new, recently updated set {II, W}. It 
also computes performance records about both sets: if the new individual, generated by one of 
the sets, is selected, then the performance variable of the corresponding set is updated (increased 
by one). If the new set overperforms the old one, then the old set and its performance variable are 
updated, and the performance variable of the new set is set to zero. 

All parameters for 93-PBIL are described in table 1. Some default values for most parameters 
were set after empirical investigation, which is shown in Section 6. p c (probability of interbreed- 
ing), p oM (probability of an old clustering hypothesis to be used during breeding instead of the 
current one) and p w (probability of the Wilson estimator to be used instead of the sample mean) 
are all set to 50% by default. 

Termination criterion of the algorithm was set to be the loss of diversity inside PVs: the algo- 
rithm finishes when all TtijS saturate (reach some value above 0.95 or below 0.05). This is said to 
be the condition for the convergence. 

The user must set the values for some parameters for which no default values are provided: 
initial population size N a , working population size N w and the number of clusters k. The working 
population exists after the initialization and is created from the best individuals of the initial 
population as described in algorithm 2. 

The next Section describes an empirical investigation performed to find a default parameter 
setting for the algorithm. 
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Figure 6: Empirical evaluation of the sensitivity of t^-PBIL to three parameters p c , p Q u and p w/ 
reporting the mean number of global optima found and maintained and the mean number of 
fitness evaluations until convergence. Five values for each parameter (0%, 25%, 50%, 75% and 
100%) are considered on each row. Assume p c = 50%, p w = 50% and p Q u = 50% where the value 
of the parameter is not explicit. Each point in the graph represents the mean of 10 independent 
runs of the algorithm (replications). Curves result from spline interpolation. When the mean 
number of fitness evaluations exceeds 100,000 the respective line segment in the graph for the 
number of global optima is truncated. 
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6 Parameter setting 



This Section shows an exploration in the parameter space using three representative test prob- 
lems: shuffled HIFF [12], concatenated trap-5 [11] and graph bisection [3]. The experiment aims 
to find default values for some parameters of </?-PBIL. We start from a hypothesis that p c , p Q id 
and p w should be all set to 50%. The problem instances, which are descrbed in the next sec- 
tion) are Pshuff64, Ptrapfive50 and Pcatring42, all three relatively small (64, 50 and 42 variables, 
respectively), but are shown to be enough for revealing an influence of the parameters on the per- 
formance of the algorithm. Pcatring42 and Pshuff64 are globally multimodal, possessing 6 and 2 
global optima respectively, while Ptrapfive50 has a single global optimum. 

The other three parameters are maintained fixed for all runs of each problem in this experi- 
ment: N Q = 2, 500, N w = 250 and k = 10 for Pcatring42 and N a = 3, 000, N w = 300 and k = 15 for 
Pshuff64 and Ptrapfive50. Those values were previously verified to be near the minimal values 
at which the algorithm becomes able to find all global solutions for each of the three instances 
considered here in this evaluation. 

Figure 6 shows the results of this investigation. Each graph shows the variation of one of the 
three parameters considered, where the other two stay fixed. When nothing else is stated, the 
value for the remaining two considered parameters is 50%. 

Changing one of the parameters can dramatically affect the behavior of the algorithm for one 
or more of the problems. Setting p^ to 0% or 100% causes a reduction on the performance on 
Pshuff64 since both old and new clustering hypothesis store information about the subsequent 
levels of the problem structure. The performance on other problems is, however, not so influenced 
by Poid, except when p oU is at 100%, where performance on Ptrapfive50 also drops down. Using 
only old clustering hypothesis seems to be harmful for the algorithm. 

Setting the probability of interbreeding - p c - to zero affects the performance of the algorithm 
for a wider range of problems, since this mechanism is responsible for recombination. Increasing 
p c slows down the convergence for all instances tested. 

The parameter p w is also negatively related to the convergence speed, but lower values for p w 
should be avoided, at least for HIFF, since no global solution for Pshuff64 is found for p w = 0. 
The effectiveness on other problems was not affected by this parameter. 

After analyzing all results and recognizing that nearly extreme values for all of the parameters 
tested are, often, undesirable, therefore we set p c , p u and p w all to 50%. 

7 Some benchmark optimization problems 

This section discusses and revises the benchmark problems used in the empirical evaluation Sec- 
tion. The reader is referred to [3] [12] [11] [13] for a more detailed description of the problems. 

A representative set of benchmark problems was chosen, which are known to be hard for a 
GA and most EDAs to solve. The existence of building blocks, or a structure in the problem, 
generally prevents GA and low-order EDAs from achieving success, since genes are assumed 
independent and structure is not preserved. Simple GA with one-point crossover and a proper 
encoding performs well for some structured problems, but the best encoding requires previous 
knowledge about the problem structure, something that is not available in general. 

Three general classes of structured problems are considered. The simpler class is that of ad- 
ditively decomposable functions (ADFs). In ADFs there is no dependence structure among the 
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subproblems since the contribution of each substructure to the overall fitness of the solution is 
independent of the value of the remaining variables, therefore all subproblems can be solved 
independently. The concatenated k-trap function [11], which is better described below, is an ex- 
ample of an ADF with l/k subproblems, where I is the size of the problem and k is the size of each 
subproblem. The highest order of dependence in the concatenated k-trap is, therefore, k. Most 
ADFs proposed in the literature, including trap-5, are deceptive. This means that lower order 
statistics would lead the algorithm away from finding the correct building block, and motivates 
the use of higher order statistics. 

A slight variation of ADFs is obtained when overlapping building blocks exist but the problem 
structure is still defined by additive functions. When building blocks overlap, some genes are 
present in more than one subproblem simultaneously and the problem structure learning task is 
harder. This class of problems should be called overlapping additively decomposable functions 
(OADFs). [13] discusses how to design combination strategies for GA when overlapping building 
blocks occur, in order to minimize the number of building blocks disrupted. 

Another class of GA-hard problems comprises Hierarchically decomposable functions (HDF). 
They were proposed after a criticism to ADFs, which present no dependence structure among the 
subproblems. HDFs, in turn, are much harder to optimize since the substructures interact. The 
fitness contribution of two substructures, when combined, is different from the sum of their indi- 
vidual contributions. One of the most important HDFs is HIFF [12] which presents interactions 
up to the size of the problem. HDFs illustrate a more general class of problems than ADFs and 
demand a more complex mechanism for combination. Substructures of each level must generally 
be first identified and recombined before the next level starts emerging. 

Apart from the taxonomy above, two other important aspects of the benchmark problems re- 
cently approached in the literature should be mentioned. Problems that present several global 
optima, called globally multimodal problems, represent a source of difficulties for GAs and EDAs 
and require efficient niching mechanisms. Also, a class of symmetrical multimodal problems has 
recently been attracting attention. Symmetry occurs when some regularities on the landscape lead 
to the existence of complementary solutions with identical fitness and, consequently, to comple- 
mentary global solutions. Globally multimodal and/or symmetrical problems can be considered 
hard for most evolutionary algorithms because combining solutions from two symmetric local 
optima will probably not be useful to find the global optima. 

Next, the problems used in this work are briefly stated. 



Twomax 

Twomax is a simple function of n binary variables: 



There are two global maxima: (0, 0, . . . , 0) and (1,1,. ..,1), both with fitness equal ton/ '2. 




(9) 



15 



Concatenated trap-5 



The concatenated trap-5 [11] is an additively decomposable function which possess a single global 
optima (the string fulfilled with Is), with a fitness equal to the size of the problem. The input 
string is partitioned into disjoint groups of five bits each. A 5-bit trap function is applied to each 
of the groups and the fitness of the individual is the sum of the contributions of each 5-bit group. 
The algorithm should be able to learn and preserve that structure in order to achieve success. The 
5-bit trap function is: 



where u is sum of the bits in the group. 

For each contiguous non-overlapping block of five variables, two building blocks can be iden- 
tified: (0, 0, 0, 0, 0) and (1, 1, 1, 1, 1), which contribute with 4 and 5 to the overall fitness, respec- 
tively. Single order (or any lower than 5-order) statistics may lead apart from the optimal value 
for each gene, what makes this problem deceptive. 

Notice that several combinations of optimal and suboptimal building blocks may lead to mul- 
tiple suboptimal solutions. Actually, since there are 2 building blocks for a partition size of 5, then 
for a problem instance with size p, there are 2 p / 5 — 1 local optima and only one global optima, 
whose fitness is very close to that of the second best optima. An instance of size 50 (Ptrapfive50) 
is considered later. 



Overlapping concatenated trap-5 

This problem is an overlapping additive decomposable function, with a fixed overlapping length. 
The overlapping scheme is circular as figure 7 illustrates for a problem with 6 BBs. Our test 
instance has 60 genes and overlapping length 2 (therefore 20 overlapping building blocks) called 
Poverfive60, which is also considered in [13]. The circular nature of the overlapping scheme 
with length 2 means that every building block shares 4 genes (2 with each neighbor) and the first 
building block is the neighbor of the last. In figure 7, for instance, building block 1 shares 2 genes 
with each neighbor, building blocks 6 and 2. 

This problem is considered harder than the previous, concatenated trap-5, because in the over- 
lapping version subproblems are not separable. 



Figure 7: A small instance of the overlapping concatenated trap-5 problem, with 6 building blocks. 
Overlapping length is 2. 




(10) 
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Hierarchical if-and-only-if 



HIFF [12] is an hierarchically decomposable function. A binary string of size 2 P represents a 
solution, where p is the number of levels in the hierarchy. The fitness of a solution is given by 

1, if |B| = 1 
|B| + f(B L ) + f(B R ), 

if |B| > 1 and (Vi{bi = 0} or \/i{bi = 1}) 
/(Bl) + /(B_r), otherwise 

where B is a block of bits (bi,..., b\ B \), |B| is the size of the block, bi is the i-th element in block. B L 
and B R are the left and right halves of the block B . The evaluation starts with the chromosome 
as a block. 

Since tight linkage occurs, HIFF is relatively simple for a GA with one-point crossover to solve. 
In the shuffled version, however, the variables are randomly rearranged in order to avoid tight 
linkage. This version is much harder for a GA and some mechanism for detecting the problem 
structure is recommended. An experiment with the shuffled version of HIFF with 64 variables 
(Pshuff64) is described later. For EDAs in general the ordering of the genes in the codification is 
not relevant, since the probabilistic models do not take any advantadge from this information, as 
opposed to GA which actually may depend on an adequate codification. 



fhiff(B) = 



Graph Bisection 

The objective of the graph bisection problem is to obtain two equally sized sets of nodes from 
the original graph such that the number of vertices linking both sets is minimal. The fitness of a 
given solution is the number of nodes minus the number of links between both sets, making it a 
maximization problem. The codification adopted is based on a binary vector of size n where the 
i-th gene represents the label of the i-th node. 

The set of possible solutions is restricted to the condition of equally sized sets; therefore, some 
individuals may appear which are unfeasible. A repair operator was adopted here and in [3] 
which randomly inverts the value of some genes. Iteratively a randomly selected gene in the 
majority is inverted until a feasible solution is obtained. 

All instances of the graph bisection considered in [3] are also studied here. Figure 8 illustrates 
three of those instances: Pgrid36 and Pcatring42 and Pcatring84, which possess 2, 6 and 6 global 
optima, respectively. The other instances considered are Pgridl6, Pgrid64, Pcat28, Pcat42, Pcat56, 
Pcatring28, Pcatring56 and Pcatring84, with the number of global optima ranging from 2 to 6. 



8 Empirical Evaluation 

This Section illustrates the performance of the cg-combination and the </?-PBIL algorithm for a 
representative set of benchmark problems. Those problems summarize most of the hurdles that 
have been reported for benchmark problems for EDAs: deception, multimodallity, overlapping 
building blocks, global multimodallity and symmetry. 
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Pgrid36 



Pcatring42 




Pcatring84 

Figure 8: Graph topologies for Pgrid36, Pcatring42 and Pcatring84. Optimal partitions are illus- 
trated by dashed lines. 

Initially, the two interbreeding mechanisms which were described in section 4 are compared. 
Next, a comparison is performed between tp-PBIL and UEBNA [3], which is a state-of-the-art 
EDA, both solving several instances of the graph bisection problem. 



Comparing two interbreeding mechanisms 

These first experiments illustrates the behavior of the cg-combination, which is compared to the 
PV-wise uniform crossover [22]. This later operator randomly mixes two building blocks during 
an interbreeding, while the cg-combination carefully chooses from which parent to take each 
position of the probability vector, as described in Section 4. In this experiment, the algorithm ip- 
PBIL was ran with its default parameters. Additionally, with k = 4 clusters and population sizes 
are N = N w = 100. 

Before a quantitative comparison, the typical behavior of both operators is illustrated for a 
single run for the concatenated trap-5 problem with size 15. The problem is decomposable, and 
for each subproblem there is an optimal building block. The three optimal building blocks - BBO, 
BB1 and BB2 - must be found and subsequently combined in order to reach the global solution. 

Finding all optimal building blocks is not difficult when the PV-wise uniform crossover is 
used, as illustrated in figure 9; the difficult thing is to mix them. It is noticeable that each clus- 
ter specializes in a specific pattern. Cluster 0, for instance, started attracting some individuals 
possessing BB2. As the evolutionary process continues, the proportion of individuals possessing 
BB2 which are attracted to cluster grows, until around 500 fitness evaluations, when 100% of 
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Figure 9: A single run of ip-FBIL with the PV uniform crossover operator for a problem with 
3 separable building blocks (BBO, BB1 and BB2). The number of fitness evaluations is reported 
in the figure since the process is incremental and there are no explicit "generations". Using PV 
uniform crossover, the algorithm fails on finding the global optima - the combination of BBO, BB1 
andBB2. 
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Figure 10: A single run of t^-PBIL with the cg-combination for a problem with 3 separable building 
blocks (BBO, BB1 and BB2). The number of fitness evaluations is reported in the figure since the 
process is incremental and there are no explicit "generations". The algorithm successfully finds 
the combination of BBO, BB1 and BB2. 
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Table 2: Comparing the concept-guided combination with PV uniform crossover. The same pa- 
rameters and random seeds were used in both sets of runs. A total of 30 runs were performed for 
each algorithm on each problem instance. The fraction of successful runs (Success %), where at 
least one global optima is found, and the mean number of fitness evaluation (Eval.) until conver- 
gence and respective standard deviation (sd.) are reported. 





ip-PBIL using 


nn 1 1 ■ tw t 

^>-PBIL using PV 


• 

problem (size) 


concept-guided 


uniform 




combination 


crossover 




Success % 


Success % 


Shuffled HIFF (128) 


97% 


0% 


Concatenated trap5 (100) 


100% 


0% 


Overlapping c. trap5 (60) 


100% 


0% 


Twomax (100) 


100% 


100% 




<^-PBIL using 


</>PBIL using PV 


problem (size) 


concept-guided 


uniform 




combination 


crossover 




Eval.± sd 


Eval.i sd 


Shuffled HIFF (128) 


105134± 11612 


81646± 13474 


Concatenated trap5 (100) 


90474± 7203 


49391± 7366 


Overlapping c. trap5 (60) 


55649± 4207 


23729± 1048 


Twomax (100) 


4825± 216 


4867± 318 



the individuals of cluster possess BB2. Similarly, cluster 2 specializes on BB0 and cluster 3 on 
BB1. A different setting may occur if another run is performed for a different random seed, but 
the general behavior is similar. The global solution, however, was not found. 

Running </?-PBIL with the cg-combination operator for the same random seed (therefore the 
same initial population) leads to a much better behavior, as figure 10 shows. In the first steps of 
the evolutionary process (roughly until 400 fitness evaluations) speciation is clear. Subsequently, 
as the number of successful combinations grows, clusters attract individuals possessing two or 
even three optimal building blocks simultaneously. Cluster 3, finally, specializes on the global 
optima of the problem after around 2,000 fitness evaluations. At that moment the algorithm 
converges, since all positions of all PVs saturate above 0.95 or below 0.05. It is interesting to note 
that some individuals possessing only one correct building block are still in the population even 
after the combination process is at advanced stages. 

A more quantitative comparison between both interbreeding operators is shown in table 2. 
Four representative problems were chosen: shuffled HIFF, concatenated trap-5, overlapping con- 
catenated trap-5 and twomax, with problem sizes 128, 100, 60 and 100 respectively. Table 2 reports 
the fraction of success (Succ. %), which is the fraction of runs where at least one global optima is 
found, and statistics for the number of fitness evaluation (Eval.) until convergence of </?-PBIL. 

Using the cg-combination, at least one global optimum is found in 97% of the runs for HIFF 
and in 100% of the runs for the other problems. Actually, when this operator is used, all global 
optima were found in 93% of the runs for HIFF and in 100% of the runs for the other problems. 
The PV-wise uniform crossover was not able to reach any global optima for any of the problems 
tested, except for twomax, which is a very simple problem without a complex structure. The 
results are not surprising as it is well known that blind crossover operators that don't respect 
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Table 3: Effectiveness and efficiency of </?-PBIL and UEBNA for 10 instances of the graph bisection 
problem as the mean ± standard deviation of the number of global optima (peaks) found and 
number of fitness evaluations after 10 independent runs of each algorithm for each problem. Data 
for UEBNA extracted from [3]. 



Problem 


EDA 


(Yt"ltl TY1 a — I— <iii F\t^1 — I— Q(H 
V_/ L> 1111 1 fi l Q\ l J_, V C\ 1 . I sil J 


PoriHI 6 

(2 peaks) 


T IFRMA fc — A 

L> XL Ul \ JTV ftj — / -± 


7 0+0 51400+7366 




7 0+0 10176+606 

Z-.U_HJ.VJ _.U__-U__.UVJVJ 


(2 peaks) 


T IFRMA h — 9 

l_J J_j Dl \ __. tv — _j 


7 n_|_n n 8^600+846? 

_-.U_I_ VJ. VJ 0>JUUU_LO _:U_- 


09-PRTT k — .5 


2 0+0 28963+10754 


(2 peaks) 


T IFRMA fc — A 

L> -Cj Dl \ Jrv tv — / -± 


7 0+0 1 74900+3479 


in-vv>w k — 10 


7 0+0 64745+10999 


Pcat28 
(2 peaks) 


TJFBNA fc — 1 


2 0+0 571 00+2846 

__ . \J _l_ v . VJ ^ / lUv 1 — -t_J~VJ 


<p-PBIL fc = 5 


2.0+0.0 14311+1299 


Pcat42 
(2 peaks) 


UEBNA fc = 2 


2.0+0.0 73900+1449 


95-PBIL fc = 10 


2.0+0.0 29714+3644 


Pcat56 
(2 peaks) 


UEBNA fc = 4 


2.0+0.0 96400+2366 


</>PBIL fc = 10 


2.0+0.0 46151+4362 


Pcatring28 
(4 peaks) 


UEBNA fc = 2 


4.0+0.0 54700+949 


<p-PBIL fc = 5 


4.0+0.0 12694+853 


Pcatring56 
(4 peaks) 


UEBNA fc = 8 


3.8+0.4 96400+1897 


V9-PBIL fc = 10 


3.9+0.3 48837+12115 


Pcatring42 
(6 peaks) 


UEBNA fc = 6 


5.9+0.3 75700+3302 


95-PBIL fc = 15 


6.0+0.0 32361+1513 


Pcatring84 
(6 peaks) 


UEBNA fc = 10 


4.8+0.8 121000+3162 


95-PBIL fc = 20 


5.7+0.7 84539+9300 



the structure of the problem are condemned to fail on these types of problems. A very similar 
behavior is observed when no interbreeding is performed, but the results are not reported in the 
table. 

Multimodal optimization and symmetry 

This experiment validates </?-PBIL on multimodal optimization problems, some of which present 
interactions among genes. Those features require good niching and linkage learning capabilities 
from the evolutionary algorithm. A through comparison between 93-PBIL and UEBNA is de- 
scribed. The latter was already shown to be competent to solve globally multimodal optimization 
problems. In [3] UEBNA is compared to other ED As when solving of instances globally multi- 
modal problems. Experiments reported in that work clearly show that UEBNA achieves much 
better efficiency and efficacy when compared to EBNA, which is based on the supervised induc- 
tion of Bayesian networks and does not incorporate cluster labels. 

The graph partitioning problem was used in this comparison. Some instances of this problem 
are hard for EDAs. Particularly, larger "ring" topologies (Pcatring56, Pcatring42 and Pcatring84) 
are challenging because they clearly present a structure of interactions among genes, besides mul- 
timodality and symmetry. In a binary codification of one gene per node there are small groups of 
7 nodes which should be identified and combined until all global solutions are found. Figure 8 
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illustrates some of those instances. 

All results for UEBNA presented here were extracted from [3]. Ten independent runs of each 
algorithm are performed for each problem instance. A fixed set of parameters for t^-PBIL is used 
for all instances: p w = 75%, p id = 25% and p c = 50% (using all 50%, which are the default 
parameters, resulted slightly inferior results). The initial population size = 4, 000 is the same 
for <y9-PBIL and UEBNA. Additionally, the working population size of </?-PBIL is set as N w = 
500. The only free parameter adopted here is k, the number of clusters. The value of k which 
empirically maximizes the performance of the algorithm is chosen, for t^-PBIL, from the interval 
k e [2,3,..., 20] and, for UEBNA, from the results reported in [3]. A similar comparison to 
UEBNA was already reported in [10], but </?-PBIL was run with default parameter set and the 
repair operator for the problem of graph partitioning was not used there. 

Table 3 summarizes the results. The problems and algorithms are organized in rows and 
the data columns show mean ± standard deviation of the number of global optima found (Op- 
tima±s.d.) and for mean ± standard deviation (Eval.±s.d.) of the number of fitness evaluations 
performed until convergence, evaluated over 10 runs. All runs of both algorithms found at least 
one global optima for all instances tested. 

It is clear from results that (/3-PBIL attains convergence for all global optima using less fitness 
evaluations. For smaller problem instances both algorithms find all global optima in all runs 
but, for greater problem instances, </?-PBIL presents better efficiency. The mean number of global 
optima found is better than for UEBNA, noticeably for Pcatring56, Pcatring42 and Pcatring84. For 
the later, which presents 6 global optima, t^-PBIL found 5.7 ± 0.7 global optima, while UEBNA 
found 4.8 ± 0.8 global optima. This may suggest that i^-PBIL presents a good scalability behavior. 
Further scalability tests were also shown in [10]. 

</?-PBIL can attain good results with relatively small populations. It is also important to note 
that computational complexity of the learning step of </?-PBIL is related to a k-means update, what 
is less time-consuming than learning Bayesian networks. Howerver, when comparing computa- 
tional complexity of </?-PBIL to other algorthms one should consider that the model is is updated 
after the introduction of a single individual, while most algorithms learn a model for each new 
population. 

9 Conclusion 

This work presents </?-PBIL: an evolutionary algorithm based on the application of an interbreed- 
ing operator, which is guided by the information obtained by clustering the population of an EDA. 
Identification of substructures is performed by clustering and combination among substructures 
can be achieved by combining information from different clusters. The core algorithm is a sim- 
ple clustered EDA that adopts order-2 probabilistic models, which leads to a very parsimonious 
approach when compared to other competent EDAs. Its simple incremental approach is also 
noticeable, where individuals are successively generated from continuously updated binomial 
probabilistic models. 

Probabilistic models used by t^-PBIL are much simpler to infer when compared to Bayesian 
networks, which are currently the most effective models for EDAs. Actually, c^-PBIL keeps proba- 
bilistic models updated after each new individual is generated and selected at the cost of a single 
k-means update. 

The most sophisticated part of the algorithm is the interbreeding mechanism, which respects 
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the problem structure and relies on the diversity of those substructures as one of the causes for 
the existence of clusters in the population. Experiments show some evidence that this mechanism 
is very important for solving problems that present interactions among the variables. Literature 
generally recommends avoiding the combination from different clusters, since this is often un- 
productive. (p-PBIL attempts to minimize bad results by carefully choosing how to combine the 
most informative portions. 

An empirical search for default values for some parameters was performed, but a more formal 
investigation should also be done. Furthermore, it is important to understand how to set the 
population size and number of clusters for each kind and size of problem. Some work on this 
topic has already been done for other evolutionary algorithms [26] and the methodology could 
be adapted for our algorithm. The relation between some of the parameters should also be better 
explored. For instance, the number k of clusters should be related to the population size N w . 
Actually, some empirical investigation (not shown) has pointed out that N w /k should be of the 
order of magnitude of 10 1 . Actually N w /k = 20 or 30 was enough for most problems verified. 

Future work will check for relevant features, like scalability for a wider range of problems in- 
cluding HDFs like HIFF [12]. Further empirical investigation on the behavior of the interbreeding 
mechanism should also be performed in order to better explain how exploration is actually done 
for several classes of problems. 
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